The dataset chosen for the analysis is the following:
They analyzed gene expression profiling of lung tissue to define molecular pathway of COPD using recent RNA sequencing technology. Lung tissue was obtained from 98 COPD subjects and 91 subjects with normal spirometry. RNA isolated from these samples was processed with RNA-seq using HiSeq 2000. Gene expression measurements were calculated using Cufflinks software.
Chronic obstructive pulmonary disease (COPD) is the third leading cause of death worldwide, causing 3.23 million deaths in 2019 (“Chronic Obstructive Pulmonary Disease (COPD) - World Health Organization” 2020). It is a common, preventable and treatable chronic lung disease which affects approximately 300 million men and women worldwide, mainly in low to middle-income countries (Ruvuna and Sood 2020).
As it develops, abnormalities in the small airways of the lungs lead to limitation of airflow in and out of the lungs. Several processes cause the airways to become narrow. There may be destruction of parts of the lung, mucus blocking the airways, and inflammation and swelling of the airway lining. It usually manifests itself through cough, wheezing and difficulty breathing.
The files were downloaded from recount2 using the study acession id “SRP041538”. The data was then scaled and the patients were divided into conditions reflecting whether or not they had COPD, with the reference being the individuals with normal spirometry (NOR). The samples are formatted according to this specific design and then filtered by average gene counts following a threshold adequate for the analysis pipeline used (Sha, Phan, and Wang 2015).
# Importing libraries
library(DiagrammeR)
library(recount)
library(DESeq2)
library(ggplot2)
library(tidyverse)
library(ggrepel)
library(DOSE)
library(DT)
require(biomaRt)
library(pheatmap)
library(fgsea)
library(reactome.db)
# Defining study acession
study_acession <- "SRP041538"
# Downloading data, if not already available
if (!file.exists(file.path(study_acession, "rse_gene.Rdata"))) {
download_study(study_acession)
}
load(file.path(study_acession, "rse_gene.Rdata"))
# Reading and labeling data
rse <- scale_counts(rse_gene)
rse$condition <- sapply(strsplit(as.character(rse$title), "-"), `[`, 2)
# Formatting dataset according to the specific design
dds <- DESeqDataSet(rse, ~condition)
dds$condition <- relevel(dds$condition, ref = "NOR")
# Filtering samples
keepers <- (rowSums(counts(dds)) / ncol(dds)) >= quantile((rowSums(counts(dds)) / ncol(dds)), .2)
dds <- dds[keepers, ]
We’ll perform the differential expression analysis using DESeq2. The referencial group is that of the individuals with normal spirometry (NOR) and the alpha error cuttoff shall be 0.05. Gene symbols are the way they’ll be referred to in the next plots and are HGNC symbols whenever possible (in the absence of an equivalent HGNC symbol, the standard EnsemblID shall be used). The genes exhibited in this following table are those statiscally significant and with an absolute fold change > 2:
dds <- DESeq(dds)
res <- as_tibble(results(dds, alpha = 0.05), rownames = "gene") %>%
rename(logFC = log2FoldChange) %>%
rename(FDR = padj)
ensembl <- useMart("ENSEMBL_MART_ENSEMBL", dataset = "hsapiens_gene_ensembl", host = "www.ensembl.org")
genes <- gsub("\\..*", "", res$gene)
genemap <- getBM(
attributes = c("ensembl_gene_id", "hgnc_symbol", "entrezgene_id"),
filters = "ensembl_gene_id",
values = genes,
mart = ensembl
)
check_local_symbol <- function(ref, local_symbols) {
if (ref %in% local_symbols$group_name) {
hgnc_symbol <- local_symbols$value[local_symbols$group_name == ref][1]
if (!hgnc_symbol == "") {
return(hgnc_symbol)
} else {
return(NA)
}
} else {
return(NA)
}
}
local_symbols <- data.frame(rowData(dds)$symbol)
local_symbols$value[is.na(local_symbols$value)] <- ""
gene_id_list <- NULL
gene_id_lister <- function(ref, genemap) {
if (ref %in% genemap$ensembl_gene_id) {
hgnc_symbol <- genemap$hgnc_symbol[genemap$ensembl_gene_id == ref][1]
if (!hgnc_symbol == "") {
return(hgnc_symbol)
} else {
ref <- check_local_symbol(ref, local_symbols)
return(ref)
}
} else {
ref <- check_local_symbol(ref, local_symbols)
return(ref)
}
}
gene_id_list <- sapply(genes, gene_id_lister, genemap)
res <- res %>%
add_column(symbol = gene_id_list) %>%
relocate(symbol, .after = gene)
deseq_genes <- res %>%
as.data.frame() %>%
filter(FDR < 0.05) %>%
dplyr::select(EnsemblID = gene, symbol, logFC, pvalue, FDR) %>%
arrange(FDR)
deseq_genes_count <- nrow(deseq_genes)
output_note <- paste0("\nTotal differential expressed genes with adjusted p<0.05: ", deseq_genes_count, "\n(top 1000 shown above)")
deseq_genes %>%
datatable()
tdds <- vst(dds, blind = T)
pcaData <- plotPCA(tdds, intgroup = "condition", returnData = TRUE)
percentVar <- round(100 * attr(pcaData, "percentVar"))
ggplot(pcaData, aes(PC1, PC2, color = group)) +
geom_point(size = 3) +
xlab(paste0("PC1: ", percentVar[1], "% variance")) +
ylab(paste0("PC2: ", percentVar[2], "% variance")) +
coord_fixed() +
ggtitle("PCA plot of COPD patients x Healthy individuals")
res <- res %>%
mutate(
Expression = case_when(
logFC >= log2(2) & FDR <= 0.05 ~ "Up-regulated",
logFC <= -log2(2) & FDR <= 0.05 ~ "Down-regulated",
TRUE ~ "Unchanged"
)
) %>%
mutate(
Significance = case_when(
abs(logFC) >= log2(2) & FDR <= 0.05 & FDR > 0.01 ~ "FDR 0.05",
abs(logFC) >= log2(2) & FDR <= 0.01 & FDR > 0.001 ~ "FDR 0.01",
abs(logFC) >= log2(2) & FDR <= 0.001 ~ "FDR 0.001",
TRUE ~ "Unchanged"
)
)
top_bottom <- function(top, res) {
top_genes <- bind_rows(
res %>%
filter(Expression == "Up-regulated") %>%
arrange(FDR, desc(abs(logFC))) %>%
head(top),
res %>%
filter(Expression == "Down-regulated") %>%
arrange(FDR, desc(abs(logFC))) %>%
head(top)
)
return(top_genes)
}
volcano_plot <- ggplot(res, aes(logFC, -log(FDR, 10))) +
geom_point(aes(color = Significance), size = 2 / 5) +
xlab(expression("log"[2] * "(Fold change)")) +
ylab(expression("-log"[10] * "FDR")) +
scale_color_viridis_d(option = "B") +
guides(colour = guide_legend(override.aes = list(size = 1.5))) +
geom_vline(xintercept = c(-1, 1), col = "red") +
geom_hline(yintercept = -log10(0.05), col = "red") +
ggtitle("Lung tissue transcriptome of COPD patients x Healthy individuals") +
geom_label_repel(
data = top_bottom(40, res),
mapping = aes(logFC, -log(FDR, 10), label = symbol),
size = 2,
max.overlaps = Inf
)
volcano_plot
tddsa <- assay(tdds)
rownames(tddsa) <- gsub("\\..*", "", row.names(tddsa))
filtered_res <- res %>%
filter(!is.na(symbol))
heat_genes <- top_bottom(10, filtered_res)
heat_genes$gene <- gsub("\\..*", "", heat_genes$gene)
topDE <- tddsa[heat_genes$gene, ]
rownames(topDE) <- heat_genes$symbol
col_annotation <- as.data.frame(colData(tdds)[, "condition"])
colnames(col_annotation) <- "Condition"
rownames(col_annotation) <- colnames(topDE)
row_annotation <- as.data.frame(heat_genes$Expression)
colnames(row_annotation) <- "Expression"
rownames(row_annotation) <- heat_genes$symbol
pheatmap(topDE,
scale = "row",
clustering_distance_rows = "correlation",
annotation_col = col_annotation,
annotation_row = row_annotation,
show_colnames = FALSE,
annotation_names_row = FALSE,
annotation_names_col = FALSE,
main = "20 genes with the strongest differential expression"
)
genemap$entrezgene_id[is.null(genemap$entrezgene_id)] <- NA
entrez_id_lister <- function(ref, genemap) {
if (ref %in% genemap$ensembl_gene_id) {
entrez_id <- genemap$entrezgene_id[genemap$ensembl_gene_id == ref][1]
if (!is.na(entrez_id)) {
return(entrez_id)
} else {
ref <- check_local_symbol(ref, local_symbols)
return(NA)
}
} else {
ref <- check_local_symbol(ref, local_symbols)
return(NA)
}
}
entrez_id_list <- sapply(genes, entrez_id_lister, genemap)
res$entrez_id <- as.character(entrez_id_list)
gsea_data <- subset(res, !is.na(stat)) %>%
subset(!is.na(FDR))
pathways <- reactomePathways(gsea_data$entrez_id)
ranked_genes <- gsea_data$stat
names(ranked_genes) <- gsea_data$entrez_id
fgseaRes <- fgsea(pathways, ranked_genes, maxSize = 500)
topPathwaysUp <- fgseaRes[NES > 0][head(order(padj), n = 10), pathway]
topPathwaysDown <- fgseaRes[NES < 0][head(order(padj), n = 10), pathway]
topPathways <- c(topPathwaysUp, rev(topPathwaysDown))
plotGseaTable(pathways[topPathways], ranked_genes, fgseaRes,
gseaParam = 0.5
)
COPD has two main components in its pathophysiological presentation: emphysema and chronic bronchitis. They are shown in the following flowchart:
DiagrammeR::grViz("digraph flowchart {
# node definitions with substituted label text
node [fontname = Helvetica, shape = rectangle]
tab1 [label = '@@1']
tab2 [label = '@@2']
tab3 [label = '@@3']
tab4 [label = '@@4']
tab5 [label = '@@5']
tab6 [label = '@@6']
tab7 [label = '@@7']
tab8 [label = '@@8']
tab9 [label = '@@9']
tab10 [label = '@@10']
tab11 [label = '@@11']
tab12 [label = '@@12', style= 'filled', fillcolor = '#E9967A']
tab13 [label = '@@13', style= 'filled', fillcolor = '#E9967A']
tab14 [label = '@@14']
tab15 [label = '@@15']
# edge definitions with the node IDs
tab15 -> tab1 -> {tab2 tab4 tab6};
tab2 -> {tab3 tab4};
tab3 -> {tab8 tab9 tab10}
{tab8 tab9 tab10} -> tab12;
tab4 -> {tab5 tab7} -> tab11;
tab6 -> tab11;
tab4 -> tab3;
tab11-> tab13;
tab14 -> tab2;
}
[1]: 'Inflammation'
[2]: 'Proteinase–antiproteinase imbalance'
[3]: 'Destruction of alveolar walls and capillaries'
[4]: 'Oxidative stress and inflamatory mediators'
[5]: 'Fibrosis and thickening of bronchiolar walls'
[6]: 'Mucus hypersecretion and ciliary dysfunction'
[7]: 'Edema and smooth muscle contration in the airways'
[8]: 'Enlarged air spaces'
[9]: 'Impaired gas diffusion'
[10]: 'Air trapping on expiration'
[11]: 'Narrowing of small vessels'
[12]: 'Emphysema'
[13]: 'Chronic bronchitis'
[14]: 'Serpin genetic deficiencies'
[15]: 'Environmental expositions'
")
I’ll present the specific findings of the analysis in terms of how they relate to the course of development of the disease.
Environmental exposures such as smoking, secondhand smoke exposure, air pollution and occupational exposure to other dusts, fumes and chemicals are the leading risk factors for the development of COPD (Ruvuna and Sood 2020). They irritate the lungs, leading to an inflammation driven mainly by neutrophils and lymphocytes.
When recruited to the bronchioli and alveoli, neutrophils release elastases, enzymes that degrade the elastin in the alveolar walls and capillaries of the lungs. Neutrophils also produce reactive oxygen species (ROS), inflammatory mediators and cytokines that promote metalloproteinase (MMPs) activity which is associated with the destruction of the extracellular matrix (ECM) components and the aberrant remodelling of damaged alveoli while inhibiting tissue inhibitors of metalloproteinases (TIMPs) activity which is associated to the maintenance of ECM integrity. This constitutes the classic protease-antiprotease imbalance that leads to emphysema by the destruction of alveolar walls and capillaries with an enlargement of air spaces, impaired gas diffusion and air trapping on expiration (Laurell and Eriksson 2013).
The neutrophil pathologic role in COPD highly depends on the activity of the CXCL-8-CXCR1/2 axis (Ha, Debnath, and Neamati 2017), which was found to be upregulated in patients with the disease. Other genes in the same axis, such as CXCL12, IL1B, RELA and EGFR are also upregulated. As for the metalloproteinases, MMP1, MMP2, MMP10, MMP19 and MMP25 were upregulated. No TIMP was found to be downregulated. On the contrary, TIMP4 was upregulated. In the GEAS plot, we can see the comparatively low enrichment in COPD patients of genes related to the maintenance of the extracellular matrix organisation. This represents the smaller expression of antiproteinases and other constitutive elements that are key to the physiological maintenance of the ECM.
The second of the main pillars for the presentation of COPD is chronic bronchitis. The whole inflammatory environment at the lungs leads to mucus hypersecretion, ciliary dysfunction, fibrosis, thickening of the bronchiolar walls, oedema and smooth muscle contraction in the airways, producing a narrowing of the small airways that characterizes COPD’s chronic bronchitis.
We can observe the increase in mucin expression in COPD patients. From the nine main human mucin genes, we find that MUC4, MUC5B and MUC5AC are upregulated. MUC5AC is the main gene previously reported as overexpressed in COPD patients’ sputum (Rogers 2008).
A less well-known dimension of the pathophysiology of COPD is its associated mitochondrial dysfunction. When faced with oxidative stress, the lung mitochondria present a decrease in membrane potential and presented abnormal expression regulation (Wiegman et al. 2015).
The GEAS analysis also demonstrates that many mitochondria-related pathways are altered in COPD patients. Prohibitin (PHB-1) has an important role in mtDNA regulation by stabilizing TFAM protein, a known protein component of the mitochondrial nucleoids (Kasashima et al. 2008). In this analysis, PHB-1 was found to be down-regulated and TFAM up-regulated in COPD patients, a combination that can explain at least partially the mitochondrial dysfunction. This increases the inflammatory response even more.